Lecture 04
dukestm packageThis is a companion package for the course where I will be putting useful functions for some of our common tasks in the course.
This is not a official or polished package but I’ve tried to include documentation for most functions. If you notice a problem open and issue or send me an email.
To install,
Updates will be made throughout the semester and I will attempt to remind you when something new is available, reruning the above will get you the latest version.
epred_draws_fix()# A tibble: 400,000 × 7
# Groups: x, y, .row [100]
x y .row .chain .iteration .draw
<int> <dbl> <int> <int> <int> <int>
1 1 -3.24 1 NA NA 1
2 1 -3.24 1 NA NA 2
3 1 -3.24 1 NA NA 3
4 1 -3.24 1 NA NA 4
5 1 -3.24 1 NA NA 5
6 1 -3.24 1 NA NA 6
7 1 -3.24 1 NA NA 7
8 1 -3.24 1 NA NA 8
9 1 -3.24 1 NA NA 9
10 1 -3.24 1 NA NA 10
# ℹ 399,990 more rows
# ℹ 1 more variable: .epred <dbl>
# A tibble: 400,000 × 7
x y .row .chain .iteration .draw
<int> <dbl> <int> <int> <int> <int>
1 1 -3.24 1 1 1 1
2 1 -3.24 1 1 2 2
3 1 -3.24 1 1 3 3
4 1 -3.24 1 1 4 4
5 1 -3.24 1 1 5 5
6 1 -3.24 1 1 6 6
7 1 -3.24 1 1 7 7
8 1 -3.24 1 1 8 8
9 1 -3.24 1 1 9 9
10 1 -3.24 1 1 10 10
# ℹ 399,990 more rows
# ℹ 1 more variable: .epred <dbl>
( dukestm::epred_draws_fix(b, newdata=d) |>
group_by(x, y) |>
tidybayes::mean_hdi(
.epred, .width = c(0.5,0.9,0.95)
) |>
mutate(contains = y >= .lower & y <= .upper) |>
group_by(prob = .width) |>
summarize(
emp_cov = sum(contains)/n()
)
)# A tibble: 3 × 2
prob emp_cov
<dbl> <dbl>
1 0.5 0.02
2 0.9 0.11
3 0.95 0.14
predicted_draws_fix(b, newdata=d) |>
group_by(x, y) |>
tidybayes::mean_hdi(
.prediction, .width = c(0.5,0.8,0.9,0.95)
) |>
mutate(contains = y >= .lower & y <= .upper) |>
group_by(prob = .width) |>
summarize(
emp_cov = sum(contains)/n()
)# A tibble: 4 × 2
prob emp_cov
<dbl> <dbl>
1 0.5 0.42
2 0.8 0.81
3 0.9 0.95
4 0.95 0.97
# A tibble: 4 × 2
.chain `mean(.estimate)`
<int> <dbl>
1 1 1.54
2 2 1.54
3 3 1.54
4 4 1.54
\(\hat{y}\)
Well, it looks like stuff is going up on average …
Well there is some periodicity lets add the month (as a factor) …
There is still some long term trend in the data, maybe a fancy polynomial can help …
Call:
lm(formula = co2 ~ date + month + poly(date, 5), data = mauna_loa)
Residuals:
Min 1Q Median 3Q Max
-0.72022 -0.19169 -0.00638 0.17565 1.06026
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.587e+03 1.460e+01 -177.174 < 2e-16 ***
date 1.479e+00 7.334e-03 201.649 < 2e-16 ***
monthAug -4.155e+00 1.346e-01 -30.880 < 2e-16 ***
monthDec -3.566e+00 1.350e-01 -26.404 < 2e-16 ***
monthFeb -2.022e+00 1.345e-01 -15.041 < 2e-16 ***
monthJan -2.729e+00 1.345e-01 -20.286 < 2e-16 ***
monthJul -2.018e+00 1.345e-01 -15.003 < 2e-16 ***
monthJun -3.136e-01 1.345e-01 -2.332 0.021117 *
monthMar -1.233e+00 1.344e-01 -9.175 5.54e-16 ***
monthMay 4.881e-01 1.344e-01 3.631 0.000396 ***
monthNov -4.799e+00 1.349e-01 -35.577 < 2e-16 ***
monthOct -6.102e+00 1.348e-01 -45.282 < 2e-16 ***
monthSep -6.036e+00 1.346e-01 -44.832 < 2e-16 ***
poly(date, 5)1 NA NA NA NA
poly(date, 5)2 -1.920e+00 3.427e-01 -5.602 1.09e-07 ***
poly(date, 5)3 3.920e+00 3.451e-01 11.358 < 2e-16 ***
poly(date, 5)4 8.946e-01 3.428e-01 2.609 0.010062 *
poly(date, 5)5 -4.340e+00 3.462e-01 -12.535 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3427 on 139 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.9966
F-statistic: 2872 on 16 and 139 DF, p-value: < 2.2e-16
| Model | rmse |
|---|---|
co2 ~ date |
2.248 |
co2 ~ month |
5.566 |
co2 ~ date+month |
0.594 |
co2 ~ poly(date,5) |
2.171 |
co2 ~ month+poly(date,5) |
0.323 |
co2 ~ date+month+poly(date,5) |
0.323 |
A generalized linear model has three key components:
a probability distribution (from the exponential family) that describes your response variable
a linear predictor \(\boldsymbol{\eta} = \boldsymbol{X}\boldsymbol{\beta}\),
and a link function \(g\) such that \(g(E(\boldsymbol{Y}|\boldsymbol{X})) = \boldsymbol{\eta}\) (or \(E(\boldsymbol{Y}|\boldsymbol{X}) = g^{-1}(\boldsymbol{\eta})\)).
This is a special case of a generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance).
\[ \begin{aligned} Y_i &\sim \text{Poisson}(\lambda_i)\\ \log E(Y_i|\boldsymbol{X}_{i\cdot}) &= \log{\lambda_i} = \underset{1 \times p}{\boldsymbol{X}_{i\cdot}}\underset{p \times 1}{\boldsymbol{\beta}} \end{aligned} \]
These data represent the total number of new AIDS cases reported in Belgium during the early stages of the epidemic.
The naive approach is to use standard residuals,
\[ r_i = Y_i - E(Y_i|X) = Y_i - \hat\lambda_i\]
Pearson residuals: \[ r_i = \frac{Y_i - E(Y_i|X)}{\sqrt{Var(Y_i|X)}} = \frac{Y_i - \hat\lambda_i}{\sqrt{\hat\lambda_i}}\]
Deviance is a way of measuring the difference between a GLM’s fit and the fit of a perfect model (i.e. where \(\theta_{best} = E(Y_i|X) = Y_i\)).
It is defined as twice the log of the ratio between the likelihood of the perfect model and the likelihood of the given model, \[ \begin{aligned} D &= 2\log\left( \frac{\mathcal{L}(\theta_{best}|Y)} { \mathcal{L}(\hat\theta|Y)}\right) \\ &= 2\big(\mathcal{l}(\theta_{best}|Y) - \mathcal{l}(\hat\theta|Y)\big) \end{aligned} \]
Call:
glm(formula = cases ~ year, family = poisson, data = aids)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 ***
year 2.021e-01 7.771e-03 26.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 872.206 on 12 degrees of freedom
Residual deviance: 80.686 on 11 degrees of freedom
AIC: 166.37
Number of Fisher Scoring iterations: 4
We can therefore think of deviance as \(D = \sum_{i=1}^n d_i^2\) where \(d_i\) is a generalized residual.
In the Poisson case we have, \[ d_i = \text{sign}(y_i - \lambda_i) \sqrt{2(y_i \log (y_i/\hat\lambda_i) - (y_i-\hat\lambda_i))}\]
Family: poisson
Links: mu = log
Formula: cases ~ year
Data: aids (Number of observations: 13)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -396.93 15.47 -427.20 -366.60 1.00 1502 2091
year 0.20 0.01 0.19 0.22 1.00 1504 2091
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# A tibble: 52,000 × 7
year cases .row .chain .iteration .draw .residual
<int> <int> <int> <int> <int> <int> <int>
1 1981 12 1 1 1 1 -18
2 1981 12 1 1 2 2 -21
3 1981 12 1 1 3 3 -19
4 1981 12 1 1 4 4 -12
5 1981 12 1 1 5 5 -17
6 1981 12 1 1 6 6 -11
7 1981 12 1 1 7 7 -13
8 1981 12 1 1 8 8 -21
9 1981 12 1 1 9 9 -18
10 1981 12 1 1 10 10 -5
# ℹ 51,990 more rows
predicted_draws_fix(g_bayes, newdata = aids) |>
group_by(.chain, .row) |>
summarize(
rmse = yardstick::rmse_vec(cases, .prediction),
crps = dukestm::calc_crps(.prediction, cases)
) |>
group_by(.chain) |>
summarize(
rmse = mean(rmse),
crps = mean(crps)
)# A tibble: 4 × 3
.chain rmse crps
<int> <dbl> <dbl>
1 1 26.3 17.7
2 2 26.2 17.6
3 3 26.3 17.8
4 4 26.3 17.7
predicted_draws_fix(g_bayes, newdata = aids) |>
group_by(.row, cases) |>
ggdist::mean_hdi(
.prediction, .width = c(0.5,0.9,0.95)
) |>
mutate(contains = cases >= .lower & cases <= .upper) %>%
group_by(.width) |>
summarize(
emp_cov = sum(contains)/n()
)# A tibble: 3 × 2
.width emp_cov
<dbl> <dbl>
1 0.5 0.231
2 0.9 0.385
3 0.95 0.462
Sta 344/644 - Fall 2023